m 



The Ring of Algebraic Functions on 
Persistence Bar Codes 



o 

04 Aaron Adcock Erik Carlsson Gunnar Carlsson * 

^ April 3, 2013 

< 

^. 1 Introduction 

-s 

C^ Persistent homology ([3], [I3]) is a fundamental tool in the area of compu- 

W tational topology. It can be used to infer topological structure in data sets 

(see [1], ^), but variations on the method can be applied to study aspects 
^ of the shape of point clouds which are not overtly topological ([5], [8]). The 

^ methodology assigns to any finite metric space (such as are typically obtained 

CO in experimental data of various kinds) and non-negative integer k a bar code^ 

Q by which we will mean a finite collection of intervals with endpoints on the 

^ real line. The integer k specifies a dimension of a feature (zero-dimensional 

O for a cluster, one-dimensional for a loop, etc.), and an interval represents 

^ a feature which is "born" at the value of a parameter (the persistence pa- 

r^ rameter) given by the left hand endpoint of the interval, and which "dies" 

• fh at the value given by the right hand endpoint. These barcodes have been 

rS demonstrated to identify structure in spaces of image patches in [T] and [4j , 

C^ and have been demonstrated to distinguish between handdrawn letters in 

[8]. Because of the unusual structure of the invariant, i.e. as a collection 
of intervals rather than numerical quantities, the method currently requires 
substantial knowledge of topological methods. It would clearly be useful to 
assign and interpret various numerical quantities attached to bar codes, so 
that these outputs could be used as input to standard algorithms within ma- 
chine learning, cluster analysis, and other methods. It is the purpose of this 
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paper to identify an algebra of functions on the set of bar codes which is 
defined in a conceptually coherent way. 

The main idea is the following. A bar code with exactly n intervals 
can be specified by a vector (xi, yi, X2, 2/2, • • • , ^n, Vn)^ where Xi denotes the 
left endpoint of the i-th interval and yi the right endpoint. However, this 
representation is many to one, in that the bar code structure does not retain 
the ordering on the intervals. In fact, the set of bar codes with exactly n 
intervals can be identified with the set 

the n-fold symmetric product of M^. For any set X, Sp^{X) is defined to 
be the orbit space of the action of the symmetric group on n letters on the 
product X^ given by permuting the coordinates. On the other hand, the 
space (M^)^ is an algebraic variety over M (^IQj). In fact, it is an affine space 
of dimension 2n, and the symmetric group action mentioned above is an 
algebraic action. It is then known (see [12j) that the orbit space inherits the 
structure of an algebraic variety, and the elements of its affine coordinate 
ring (J.IOJ) are functions on the set of bar codes with exactly n intervals. 
These affine coordinate rings are well known algebras referred to generically 
as rings of multisymmetric polynomials ([9j). They can be quite complicated, 
since it turns out that any set of algebra generators for them will satisfy non- 
trivial relations or syzygies. It turns out, though, that there are inclusions of 
algebraic varieties 

5'p^(M2)^5'p^+1(M2^ (1) 

which produce an inverse system of graded affine coordinate rings 

> A[5p^+^(M2)] ^ AlSp^'iB?)] ^ . . . 

whose inverse limit we will denote, by abuse of notation, by yl[5']9^(]R^)]. 
The notation A[—\ denotes the affine coordinate ring. This algebra is known 
to be freely generated on a set of minimal algebra generators ([9]). 

The analysis of the system ([l]) above is not sufficient, though. This system 
identifies a point ((xi, ^1), . . . , (x^, ^^)) G Sp^i^^) with the point 

((xi,yi),...,(x„,^„),(0,0))G5p"+i(]R2) 

In other words, a set S oi n intervals is identified with the set of n + 1 
intervals obtained by adjoining the interval of length zero whose two end- 
points are zero. However, in the parametrization of the isomorphism classes 



of persistence vector spaces in [T3] by barcodes, any interval of length zero 
is identified with the zero module. So, we would like to determine the ring 
of all algebraic functions (i.e. the elements of A[S']9^(]R^)]) which have the 
property that they take the same value on any barcode as on the result of 
adjoining any interval of length zero to it. In this paper, we will identify 
this subring, describe its structure, and describe the algebra generators ex- 
plicitly so that they can be used eflFectively by those interested in analyzing 
databases of shapes. 

2 The Ind-scheme 55 

We first discuss the set of bar codes, without any algebraic variety structures. 
For every n, we first consider the set of bar codes containing exactly n inter- 
vals. We will permit intervals of length zero. The set of intervals J can be 
identified with the subset of M^ consisting of pairs (x, y) with x < y. A bar 
code containing n intervals is therefore identified with the n-fold symmetric 
product Sp^{3)^ where for any set X, Sp^{X) is defined to be the orbit space 
of the action of the symmetric group on n letters on the product X^ given 
by permuting the coordinates. One can assemble these sets into a directed 
system 

3 ^ Sp\3) ^ Sp\3) ^ Sp\3) ^ • . . 
where the maps i^ '• Sp^{3) -^ Sp^^^{3) are given by 

z,({/i,...,4}) = {/i,...,4,[o,o]} 

The direct limit of this system will be denoted by Sp'^{3). We are interested 
in studying functions on Sp'^{3). Such a function can be identified with 
an infinite vector (/i, /2, /a, • • •) of functions fn : Sp^{3) -^ M satisfying the 
compatibility condition 

Jn-\-l ' ^n In 

The set of all such vectors of functions forms a ring 31 under coordinate- 
wise addition and multiplication. It is not exactly what we want, however. 
The reason is that under the parametrization of persistence vector spaces as 
described in [3] and [13j, intervals of length zero correspond to zero vector 
spaces, and therefore all intervals of length zero should be considered equal. 
This means that we should consider only functions F : Sp'^{3) -^ M for 



which 

F({Ji,J2,...,4,[^,e]}) = F({/i,/2,...,4,[77,?7]}) 

for aU possible values of ^ and ij. The set of all such functions is a subring 
31' C 01. This set of functions can be defined as the set of all functions on 
the set 03 defined by 

n 

where c^ is the equivalence relation generated by all relations of the form 

{/i,/2,4,[e,^]}^{/i,/2,...,4}. 

Remark: The reader may suggest that one consider instead only the 
subset 3^ consisting of intervals of positive length. This will produce a 
disjoint union of sets of barcodes, partitioned into the sets containing a fixed 
positive number of intervals of positive length. Such a description does not 
take into account the fact that we would like to topologize the space of all 
bar codes in such a way that 

lim^^Q{Ii, /2, . . . 4, [xn+i, Xn+i + e]} = {/i, /2, . . . , 4} 

The reason for this is that small perturbations to the input data to the persis- 
tence algorithms can modify the barcodes by modifying lengths of intervals a 
small amount and add intervals of small length. This is the stability theorem 
for persistence diagrams proved in 0. 

The ring of functions 31' is too large to deal with eflFectively. Even the 
much smaller ring of continuous functions on OS is still too complex to describe 
completely. We will observe that 53 is described as a colimit of algebraic 
varieties, and that it is therefore possible to define the ring of algebraic 
functions on ^. It is this ring we will analyze. 

Throughout this paper, k will denote the field M. All varieties will be over 
k. We consider the affine space 21^ = A(n) of dimension 2n, parametrized 
with coordinates (xi, ^i, X2, ?/2, • • • , ^n, Vn)- Its affine coordinate ring is the 
polynomial ring Bn = /c[xi, ^i, . . . , x^, ?/^]. There is an action of the sym- 
metric group Sn on n letters on 21^, and from [12] it follows that the set of 
orbits on the set of points of the variety is itself an affine algebraic variety, 
with aflSne coordinate ring equal to the invariant subring 5f ^ . Let Wi C 21^ 
denote the subvariety yi — Xi = 0. We let D^ ^ Bn denote the subring of 
functions whose restriction to Wi is independent of Xi for all i. We wish to 
characterize this subring algebraically. 



Proposition 1. The ring D^ is characterized algebraically as the suhring of 
all f for which 

for all i. 

Proof: We fix i, and consider all the functions / for which f\Wi is indepen- 
dent of Xi (and therefore yi). The operator ^ + ^ induces a differential 
operator on the quotient ring Q^ = Bn/{yi — x^), which is identified with the 
partial differential operator 2^ in 

The requirement is that the image / of / in Wi is independent of x^, and this 
is equivalent to the condition ^(/) = 0. This condition is to hold for each 
i, which gives the result. D 



3 The ring of algebraic functions on 55 

We begin by changing coordinates via the formulae ^i = Xi + yi and rji = 
yi — Xi. It is clear that B^ can also be identified with /c[(^i, ryi, . . . , (^^, ry^], and 
that the symmetric group in the new coordinate system permutes the (^^'s 
and rji's. Under this transformation, the operator ^ + ^ is carried into the 

operator 2^. This means that the ring D^ is identified with the subring of 



functions /(Ci, ^i, • • • , ^n, Vn) for which ^ G (rji) for aU 



^C^ 



I. 



Proposition 2. A k-basis for the D^ is given by the set of monomials 

SI S2 Sn 'II '12 'In 

for which a^ > implies hi > 0. 

Proof: We note that the operator d/d^i carries each monomial to a constant 
multiple of a single monomial, namely the monomial obtained by decreasing 
ai by one. Moreover, containment in the ideal (rji) is also given purely by 
conditions on monomials, i.e. that hi > 0. We conclude that Dn is spanned 
by monomials lying in D^- But it is clear that a monomial /i lies in D^ 
exactly if it is the case that whenever ^i divides /x, then rji also divides /i. 



This corresponds to the above numerical condition on the exponents in the 
monomial. D 

The symmetric group action clearly preserves the subring D^. Moreover, 
it preserves the basis of monomials within D^. Let {iiJ^a}aeA denote a set 
of orbit representatives of the S'^-action on the set of monomials defined in 
Proposition [2] Let aa denote the sum of all the elements in the orbit of /Ha- 

Proposition 3. We let D^"" denote the subring of elements of D^ which are 
invariant under the action of Sn- Then the elements a^ form a k-basis of 



n 



Proof: This result plainly holds for any algebra over a field of characteristic 
zero on which there is a G- action which preserves a basis of monomials. D 
We have restriction maps iin.m '• D^ -^ i^m, when n > m^ defined by 
7rn,m (6 (resp rji)) = <^i(resp rji) for i < m, and 7r^,m(6) = for i > m. The 
map TTn^m IS S'^-cqui Variant, where Sm acts by permuting the first m pairs of 
variables. It follows that we may construct composites 






which we denote by cr^^r^^ and therefore the inverse system 



(J2A 



^n ^ ^n-1 ^ • • • ^ ^1 

We will denote the inverse limit of this graded system by !l). 

We next recall some of the notation and basic facts about multisymmetric 
polynomials, which can be found in Dalbec [9j. Let Rn^r be the polynomial 
ring in nr variables, 

and let 

A = i?*^^ 

denote the ring of Sn invariants, where the symmetric group acts diagonally. 
There is an inverse system parallel to the one constructed above involving 
the rings A^^^. We have evaluation maps 

defined by setting Xir = if i > m. The map TTn^m is S^^-equi variant, when 
Sm ^ Sn is the subgroup of permutations of the first m elements of the set 
{!,..., m}. We have the composites 

6 



A = RSn ^ pSr^ ^I!f R5™ ^ ^ 



^n.r 



which we denote by Pn^m- The inverse hmit of the system 

Pn+l n . Pn,n-1 . Pn-l,n-2 P2,l . 

as graded rings wiU be denoted by A^, and referred to as the ring of r- 
multisymmetric functions. Its grading is given by 



A. = 0A: 



induced by the grading on Rn^r- There is an evident embedding 2) ^^ A2. 
We wiU use this embedding to identify the structure of J). 

The ring of multisymmetric functions has several interesting sets of gen- 
erators. Given some nonzero vectors 

a^ = {an,..., air) ^ W\{), 

we define the multisymmetric monomials by 

where Sym is the symmetrization map. Sym applied to a monomial yields 
the sum of all monomials which are in the orbit of the S'^-action. 

They form a vector space basis of A^^^, for any n. It is known that A^^^ is 
generated as an algebra by the symmetrizations of monomials involving only 
{xii, X12, . . . , Xir). They are given by the formulae 

i 

and are called the multisymmetric power sums. While there are relations 
among the power sums in finitely many variables, they freely generate the 
inverse limit A^, making it a polynomial algebra. See [9j for details. 
We will be interested in the case r = 2. Let us set 

Xi Xi\, Iji Xi2) 

and let 

An = Rn,2 = k[Xi,yi,...,Xn,yn]' 

The subalgebra 2) C A2 now has the following characterization. 



Theorem 1. As a subalgebra of A2, 3D is freely generated by elements of the 
formpa+i^b-PaMi- 

Checking that these generators are contained in 5, and that there are no 
relations between them is easy. The work is in calculating Hilbert series, 



P(f^) = 5]dimM(f^')t^ 



k>0 

with induced grading on Q. We do this in the following lemmas. 
Lemma 4. An W-basis for B^ is given by the set of monomials 

for which bi > implies a^ > 0. 

Proof Make the substitution Zi = Xi — yi^ which corresponds to an isomor- 
phism 

A = R[zi,yi,...,Zn,yn]^ 

and notice that B^ is exactly the kernel of the differential operator d/dzi. The 
operator d/dzi carries each monomial to a constant times a single monomial, 
namely the monomial obtained by decreasing a^ by one. Moreover, contain- 
ment in the ideal [zi] is also given purely by conditions on monomials, i.e. 
that a^ > 0. We conclude that B^ is spanned by monomials lying in B^. But 
it is clear that a monomial ji lies in B^ exactly if it is the case that whenever 
yi divides /x, then zi also divides ji. This clearly corresponds to the above 
numerical condition on the exponents in the monomial. D 

Lemma 5. The Hilbert series of Vt is 

P{Q) = \[{l-t^)-\ 

d>l 

Proof The above proposition shows that Bn has a basis of monomials which 
are invariant under the S^^-action. Whenever this is true, any set of orbit 
representatives constitute a basis of B^"" over a field of characteristic zero. 
We define such a set of representatives by the monomials of the form 



where I < n^ and (f : N+ -^ N+ x N is the bijection 

{^^, ^2, ...) = ((1, 0), (1, 1), (2, 0), (1, 2), (2, 1), (3, 0), (1, 3), ...) 

onto the set of possible nonzero exponents. The dimension of the fc-graded 
component of B^"" is just the number of these monomials of degree k. 

Let us say that (a, b) < (c, d) when (f~^{a^ b) < (f~^{c^ rf), and let /(a, 6, k) 
denote the number of sequences (ai, 61, ..., a/, 6/) such that 

(a, 6) > (ai, 61) > . . . > {ai, k), (a„ h) G N+ x N 

with no restrictions on /. It is easy to check that it satisfies the recursion 
relation 

/(a, b,k)= J] /(c, d,k-c-d), 

{c,d)<{a,b) 

which leads to the formula 

k>0 l<k<a+b-l 

We then have 

lim P{B^-)= lim y"f{a,b,k)t^ = Y\{l-t^)-\ 

^ k>l k>l 

D 
We can now prove the theorem. 
Proof. Let 

n' = R[pw -poi.pii -P20,...]. 

It is simple to check that Q^ C Q. There are no relations between these 
generators because the homomorphism 

A2 ^ A2, Pa+l,b ^ Pa+l,b - PaMl^ PO,b ^ PO,b 

is an isomorphism. It remains to show that the two rings have the same 
Hilbert series. But P{^^) obviously equals the generating function in lemma 
[5| because there are k generators in degree k. D 



4 Machine Learning on 53 with examples 

4.1 Digits Example 

To illustrate the classification potential of this technique, we apply it to the 
MNIST database [H], of handwritten digits. We emphasize that the aim is 
not to outperform existing machine learning algorithms for digit classifica- 
tion, but to present an example that demonstrates one way of combining this 
technique with existing machine learning techniques. While it is clear that 
pure topological classification cannot distinguish between the digits (there 
are three numbers that do not have any loops, three that always have loops, 
one that has two loops and three that have style-dependent loops), we can 
use the power of persistent homology to sift out more information. We begin 
by showing the full analysis of a few digits and then give the empirical results 
of applying this technique to a subset of the MNIST database. 

4.1.1 Topological Methods 

We begin by describing a particular graph construction given a digital image. 
We treat the pixels as vertices and add edges between adjacent pixels (includ- 
ing diagonals). We can now define a filtration on the vertices of the graph 
corresponding to the image pixels. A natural filtration could be constructed 



using the pixel intensities of the original image (see Figure pi Section 4.2). 
Another filtration, used in j8], can be constructed by thresholding, to pro- 
duce a binary image, and adding 1-pixels as we sweep across the image. This 
adds spatial information into what would otherwise be a purely topological 
measurement. Since the orientation of the digit matters (a 6 is the same as 
a 9 given a 180 degree rotation), we choose the latter approach and sweep 
across the rows and columns of each digit. 

By taking into account spatial information, we get a rough view of the 
location of various topological features. For example, though a '9' and '6' 
both have one connected component and a single loop, the loop will appear 
at different locations in the top-down filtration for the '9' and '6'. The digits 
and one of the resulting barcodes are shown in Figures [T] and [2} Using all 
four sweeps, and both the Betti and Betti 1 barcodes, reveals additional 
differences between each of the digits. 



10 



\ 

3 



I I i .LJ I I I I I 



is '''' |' ''' 4L ''' ' |''' ^y' '' j ' ' ' UJ ''' | ''' u.j' ' M'''y.. " 1" 

Z'.9 sis ff.7 11.6 H.5 17.4 



Digit 1 Right Sweep: Dimension 
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Digit 3 Right Sweep: Dimension 
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Digit 4 Right Sweep: Dimension 
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Digit 7 Right Sweep: Dimension 



Figure 1: No Loop Digits with Betti barcode, sweep to right 
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Digit Up Sweep: Dimension 1 
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Digit 2 Up Sweep: Dimension 1 
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Digit 6 Up Sweep: Dimension 1 
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Digit 8 Up Sweep: Dimension 1 
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Digit 9 Up Sweep: Dimension 1 



Figure 2: Loop Digits with Betti 1 barcode, sweep to top 



4.1.2 Feature Selection 

We can use the techniques described in this paper to coordinatize the bar- 
code space OS. In machine learning terminology, these coordinates are called 
features. This allows us to characterize the barcodes generated by each data 
point as a compact feature vector. This also gives us great flexibility in se- 
lecting features that work well with our data. We can then apply a standard 
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machine learning algorithm, such as a support vector machine (SVM), to 
classify the data. 

We selected a set of four features from the invariants discussed in this 
paper. Intuitively, the exponents in each polynomial will give the relative 
value of small bars or endpoints compared to large bars or endpoints. For 
example, if comparing two bars of length | and 6, the first bar will have 
more weight in an invariant linear polynomial than in an invariant quadratic 
polynomial. Indeed, 
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We selected four features, 

^Xi{yi - Xi) 

i 

^ivmax -yi){yi-Xi) 

i 

^x^iiVi - XiY 

i 
i 

which when applied to the four sweeps, each with a 0-dimensional and 1- 
dimensional barcode, gives a feature vector of total size 32 which we then 
arranged into a feature matrix. Intuitively speaking, the first two features 
take all of the bars, lengths and endpoints, into account. The second two 
features heavily favor the arrangement of longer bars. A visualization of 
a matrix of 10,000 digits using classical multidimensional scaling (MDS) is 
shown in Figure [3] and the spectrum of the matrix is shown in Figure [4j 
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(a) A 2D View of the Data 




(b) A 3D View of the Data 
Figure 3: Visualization of Data using Topological Features 
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Singular Value Index 



Figure 4: Normalized Spectrum of Topological Feature Matrix 

As is typical when using a SVM, we scaled each coordinate such that the 
values were between and 1. The SVM was implemented using software 
provided by [6j. 

4.1.3 Classification Results 

We applied these methods on a subset of 1000 digits from the MNIST 
database to tune parameters of the algorithm and test various kernels. For 
the radial basis function e~^l^~^l (RBF, also known as the Gaussian kernel), 
we used 7 = 8. For the polynomial kernel (7(1^ * 'L') + a)^, we used d = 3 
with 7 = 2 and a = 2. In both functions, u and v represent the calculated 
feature vectors. After this, we progressively increased the size of the subset 
to 10,000 handwritten digits. 

The classification accuracy was measured by partitioning the data set into 
one hundred subsets and using cross-validation successively on each subset. 
The results are shown in Table [H 
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Table 1: Classification Accuracy of two SVM Kernels 



SVM 


1000 Digits 


5000 Digits 


10000 Digits 


Gaussian 


87.70% 


91.54% 


92.04% 


Polynomial 


88.00% 


91.62% 


92.10% 



With the polynomial kernel, an error of 7.9% is seen. As mentioned above, 
the purpose of this test is not to outperform existing classification algorithms 
but to demonstrate one application of the topological features. In line with 
this, we examined some of the digits that the algorithm failed on. Figure [5] 
shows a few of the typical problem digits. 

(a) Stylistic Problems 

(b) Spurious Topological Changes 
Figure 5: Common Misclassifications 

The most common confusion is between a '5' and a '2' written with no 
loop. Other confusions often occur between the shown style of '7' and slanted 
'3's and between a certain style of '4' and a '9'. These confusions are not 
unexpected since these numbers are topologically the same. The extra spatial 
information added by the directional sweeps is sensitive to variations in the 
slant or style of handwriting and a visual inspection of these digits suggests 
why the algorithm has difficulty classifying these particular examples. Other 
common confusions occur when topological changes occurred to the digit, 
specifically when the writer adds or removes a loop. 
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4.2 Hepatic Lesion Classification 

In this example, we apply topological features to classifying hepatic lesions. 
The dataset consists of computed tomography (CT) scans of 132 hepatic 
lesions that are outlined and annotated by radiologists. There are nine di- 
agnoses represented in the data: cysts (45 lesions), metastases (45 lesions), 
hemangiomas (18 lesions), hepatocellular carcinomas (HCC, 11 lesions), focal 
nodules (5 lesions), abscesses (3 lesions), neuroendocrine neoplasms (NeN, 3 
lesions), a single laceration and a single fat deposit. Additionally, there are 
no controls for the size of the lesion and the lesions vary from under 100 
pixels to 10,000 pixels. Because of the unbalanced nature of the data, we 
focus on the subset of cysts, metastases, and hemangiomas. 

Classification results using the barcode metric (matching metric) were 
first presented in [2j, and we follow the same methods for processing and 
generating barcodes from the data. We will briefiy describe the methods 
here. For a more detailed account, please read ^. 

4.2.1 Topological Methods 

As mentioned above, a natural filtration for an image is to filter by the pixel 
intensity. An example of this filtration is given in Figure [6) The variation 
in pixel intensity allows us to use a one-dimensional filtration on the pixel 
intensity, but as the results will show, the classification is improved when 
geometric information is added into the filtrations. 

As there is no rotational orientation of the lesions, we cannot add in 
geometric information using the sweeps described in the previous section. 
Instead, we use the lesion border provided by the radiologist and assign 
each pixel its distance from the border. Then, by using two-dimensional 
homology, we achieve improved results, especially in the case of the heman- 
giomas which are characterized by large dense regions on the outer part of 
the lesion. Because two-dimensional filtrations are computationally intensive, 
we approximate the two-dimensional filtration with one-dimensional barcode 
'slices' along the border filtration axis. We use 7 slices per lesion and both 
the Betti and Betti 1 barcodes. 

Note that we can look at each filtration from each direction and catch 
different features. The intensity filtration can add high intensity pixels first 
or low intensity pixels first. The boundary filtration can begin with pixels 
near the boundary first or pixels far from the boundary first. This yields 56 
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one-dimensional barcodes per lesion. 




(a) Simple image with filtered complex 



1 1 II i jj ii 1 11 1 1 i jj i II 11 1 iijj i II 1 1 



..L I .L I :.L I :L ]/';.L: iii ] iiii l iii ij 



(b) /3o barcode for above image 

[ llii jj iii ^ ii i i jj iii ^ iiii jj iii ^ iill jj lll ^ llll jj lll ^ llll jj il ^^ 

(c) /3i barcode for above image 

Figure 6: Constructing an increasing ID-filtration on an image [2] 

4.2.2 Feature Selection 

We use a slightly different set of four features as compared to the digits 
example. These features are shown below. The two sets of features that 
focus on long bars and features which take into account shorter bars is used 
here. In this application, this is analogous to filtering the barcode to remove 
the large number of smaller bars. Because of the variations in lesion size, 
we look at the average over each bar to try and eliminate the effects of large 
variations in lesion size. 
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As mentioned above, we have 56 barcodes per lesion. With four features, 
this yields a feature vector of 224 features for each lesion. 

4.2.3 Classification Results 

We apply the SVM using only the Gaussian kernel and use an exponential 
parameter sweep to find optimal values of 7 for each method. We use LOOCV 
to calculate the classification accuracies. The results are shown below. Table 
[2] gives the results for ID and 2D filt rat ions for several diflFerent datasets 
while Table [3] shows how well the algorithm performs on different lesion 
types for the different filt rations. Table [4] demonstrates the effect of size on 
classification. 

Table 2: SVM Classification Accuracies for ID and 2D Filtrations 



Filtration 


Full 


HcHeCM 


HeCM 


CM 


ID (Intensity) 


53.03% 


59.66% 


65.74% 


75.56% 


2D 


67.42% 


74.79 % 


81.48% 


86.67% 



Using [2] , we see that that topological features are comparable with using 
the matching metric to generate features. The results from the HeCM dataset 
for the two methods are shown below. They refiect the correct classification 
of a single lesion using a the topological features, making the two methods 
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Table 3: HeCM % Classification Accuracy by Lesion Type 



Filtration 


% of HeCM 


% of Heman. 


% of Cysts 


% of Metas. 


ID 


65.74% 


33.33% 


75.56% 


68.89% 


2D 


81.48% 


61.11% 


86.67% 


84.44% 



Table 4: Classification by Lesion Size of HeCM 



Lesion Size by Area 


% Accu. 


# of Heman. 


# of Cysts 


# of Metas. 


All 


81.48% 


18 


45 


45 


< 10000 px 


82.52% 


18 


42 


43 


<5000 px 


84.78% 


16 


39 


37 


<2500 px 


86.25% 


14 


32 


34 


<1250 px 


88.514% 


8 


28 


23 



virtually the same for this subset of the data. Comparing with the other 
results in [2] shows that the two results are very close in most categories, 
with each slightly outperforming the other in certain subsets of the data. 

Table 5: Classification Methods 



Filtration 


Barcode Features 


Matching Metric 


ID 


65.74% 


63.80% 


2D 


81.48% 


80.56% 



4.3 Discussion 

These two examples demonstrate the classifying power of topological fea- 
tures when applied to real world datasets. This was done using off-the-shelf 
machine learning algorithms showing that these features can easily be com- 
bined with more traditional classification methods adding a set of additional 
classification features to the machine learning toolbox. 
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These examples also show the power of combining topology with geome- 
try. In both datasets, this is an integral part of the classification procedure. 
The results in the hepatic lesion dataset provide an especially good example 
of the potential gains that can be achieved by combining both fields. 

In summary, using algebraic geometry and invariant theory, we have iden- 
tified a family of coordinates on the space of finite metric spaces, or sampled 
shapes. These coordinates can serve as a method for organizing the collec- 
tion of all barcodes, and therefore any database whose members produce 
barcodes. Of course, we can also use various metrics on barcode space, such 
as the bottleneck or Wasserstein distances. It would be extremely interesting 
to analyze the relationship between these distances on barcode spaces with 
various more algebraic notions of distance on the barcode coordinates. It 
would also be very interesting to define and analyze analogous coordinates 
on spaces of multidimensional persistence modules, where they might give 
information which is currently not accessible due to the complexity of the 
algebraic descriptions of multidimensional persistence modules. 
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